Method and arrangement for pattern recognition on the basis of statistics

ABSTRACT

A method and an arrangement are presented for pattern recognition on the basis of statistics. According to the method, for an object to be recognized on the basis of a complete set of two-class or multiclass classifiers, the association with each target class of the class set is estimated with a numerical value that is produced by cascaded use of polynomial classifiers. According to the invention, on a learning sample in which all class patterns to be recognized are sufficiently represented, there is a selection, from all the two-class or multiclass classifiers by way of their estimation vector spectrum, of those two-class or multiclass classifiers with estimations contributing the most to minimize a scalar quantity calculated over the estimation vector spectrum and having high separating relevance. The selected two-class or multiclass classifiers are subsequently used to form, via an expanded learning sample, estimation vectors from which expanded characteristic vectors are produced by polynomial linking. An evaluation classifier is formed on the basis of said characteristic vectors for estimating all target classes.

BACKGROUND OF THE INVENTION

Pattern recognition is becoming increasingly significant in the era ofelectronic data processing. Its range of use extends from automationtechnology to machine-based image and text processing, where it is usedfor automatic letter distribution (address reading) or to evaluateformulas or documents. The objective of pattern recognition is toallocate to a piece of electronically pre-processed image information anidentification that reliably coincides with the true meaning of thepattern. Statistics-based pattern-recognition methods assess a digitizedpiece of image information with estimates from which the degree ofassociation of the pattern with a class of patterns can be read. With Kgiven target classes, the class whose estimation result corresponds tothe maximum of all K estimates is generally awarded this assessment. Arecognition system is more reliable the more frequently the target classestimated as the maximum class matches the true target class (meaning).A network classifier used to this point, which comprises a completeensemble of two-class classifiers and has the task of discriminating theK target classes, is based on the fact that a two-class classifier iscalculated for all possible K*(K−1)/2 class pairs. During a readingoperation, for the present pattern, each of the two-class classifierssupplies an estimate of the association of the pattern with one of thetwo fundamental target classes. The result is K*(K−1)/2 estimates, whichare not independent among themselves. From these K*(K−1)/2 estimates, Kestimates are to be formed, one for each target class. The theoryprovides a mathematical rule for this relationship, which is describedin Wojciech W. Siedlecki, A formula for multiclass distributedclassifiers, Pattern Recognition Letters 15 (1994). The practice ofclassifiers demonstrates that the applicability of this rule isunsatisfactory, because the two-class classifiers supply no statisticalconclusion probabilities as soon as they estimate a foreign pattern thatis not part of their adapted class range. In practice, this means thatshutoff mechanisms must deactivate those classifiers that are notresponsible for the pattern as early as possible. The shutoff rules usedto this point in practice are largely of a heuristic nature.Consequently, an element of arbitrariness that is not statisticallycontrolled is factored into the processing of network classifiers. Thisrule-based iteration of variables that experience a measurable statisticbehavior significantly worsens the recognition results. Rule-basediteration of network classifiers additionally prevents the possibilityof effectively re-training the classifier system when the samples arechanged. With 30 or more classes to be discriminated, the use of networkclassifiers also meets with fundamental problems:

1. The number of components (pair classifiers) to be stored increasesquadratically with the number of classes (K* (K−1)/2).

2. An assessment and combination of the component-related estimates intoa reliable total estimate becomes increasingly less reliable with agrowing number of classes.

3. Adaptations of a network classifier to country-specific writingstyles incur considerable costs in the adaptation phase.

OBJECTS AND SUMMARY OF THE INVENTION

The object of the invention is to create a statistics-basedpattern-recognition method and an arrangement for executing the method,with which the outlined problems of the state of the technology areavoided, with a large class number and for justifiable costs, and whichis capable of performing general recognition tasks in real time,avoiding a rule-based iteration of network classifiers.

A particular advantage attained with the invention is a significantincrease in recognition reliability through the avoidance of heuristicshutoff rules. Following the selection of the two- or multi-classclassifiers and the generation of the assessment classifier, the entireset of statistics of the application is represented in the moment matrixused as the basis of the assessment classifier. Because only the reducedsequence of the two- or multi-class classifiers need be stored togetherwith the assessment classifier, the memory load is very economical.Because polynomial classifiers manage all operations following theimage-feature preparation by means of addition, multiplication andarrangement of natural numbers, more complex calculations for examplefloating-point simulation on the target hardware, are completelyomitted. Memory-hogging table-creating mechanisms can also be omitted.These circumstances permit the focus of the target hardware design to beoptimizing the transit time. In one embodiment of the invention, the sumof the error squares in the discrimination space is selected as a scalarseparation measure (classification measure). The advantage of this isthat a ranking by contribution to the minimizing of the residual erroris explicitly formed among the components over the course of the linearregression accompanying the calculation of the linear classifier. Thisranking is utilized in selecting from the available two- or multi-classclassifiers, which selection forms the reduced set of pair classifiersas a result. The method of minimizing the residual error is described indetail in Schurmann, Statistischer Polynomklassifikator [StatisticalPolynomial Classifier], R. Oldenburg Verlag [Publisher], 1977.

Another embodiment uses the entropy in the distribution space of theestimation vectors as a scalar separation measure. To evaluate theentropy, the frequency of appearance of each state of allpair-classifier estimates over the quantity of all states must bedetermined. Then, the partial system that produces the least entropy isdetermined. In the embodiment, a large quantity of target classes isdivided into numerous quantities of target classes, for which aselection of the two- or multi-class classifiers is made, and from this,the assessment classifier is generated. A resulting total estimate isthen determined from the results of the assessment classifiers. Theresulting total estimate can be calculated in various ways:

In one embodiment a Cartesian-expanded product vector is formed from theresult vectors of the assessment classifiers, from which a resultingquadratic assessment classifier is formed which determines the totalestimate.

2. A Cartesian product vector is also formed in a second embodiment. Bymeans of a subspace transformation U, this vector is converted into ashortened vector, of which only the most crucial componentscorresponding to the eigenvalue distribution of the transformationmatrix U are used to adapt a quadratic classifier. This quadraticclassifier then maps the transformed and reduced vector for anestimation vector onto the target class.

3. Fundamentally, in another embodiment a meta-class classifier, whichis trained over groups of class quantities, generates estimates for thegroups prior to activation of the respective selection of the two- ormulti-class classifiers. Afterward, the two- or multi-class classifiersfor the characters of the groups whose estimated value lies above anestablished threshold are activated. To determine the total estimate,the group estimates are linked to the estimates of the respective,associated character-assessment classifiers for the character targetclasses according to a unified rule such that the sum of all characterestimates obtained in this manner yields a number that can be normalizedto 1. The first variation yields the most precise results with the mostcomputational effort, while the second and third variations contributeto the reduction in the computational effort.

BRIEF DESCRIPTION OF THE DRAWINGS

The realization of the invention can be divided into five phases. Eachphase of the invention is described in detail in the following section,with reference to a respective drawing. Shown are in:

FIG. 1 the procedure in the generation of a complete set of pairclassifiers of a network classifier;

FIG. 2 the procedure in the generation of a reduced set of pairclassifiers;

FIG. 3 the procedure in the generation of an assessment classifier;

FIG. 4 the procedure in pattern recognition in an arrangement of theinvention; and

FIG. 5 the procedure in pattern recognition with a large number ofclasses, with the use of a meta-class classifier.

BRIEF DESCRIPTION OF THE INVENTION

The invention is explained below by way of ensembles of two-classclassifiers (pair classifiers), but is not limited in principle to thisspecification.

Generation of a network classifier:

In accordance with FIG. 1, this process step begins with the binaryimages. An image vector {right arrow over (u)} that has been convertedinto binary form is present for each image from the learning sample. Inprinciple, the black pixels of the image are represented by 1, while thewhite pixels are encoded with 0. In addition, a human-issued referenceidentification is created for each pattern; this identificationunambiguously encompasses the meaning of the pattern. Normalizationtransforms the binary image into a gray image based on measurements ofthe local and global pixel proportions. In the process, the featurevector {right arrow over (v)} is formed, which has 256 components, witheach component containing gray values from the scaling range [0, 255].The vectors {right arrow over (v)} are subjected to a principal-axistransformation with the matrix {right arrow over ({right arrow over(B)})}. The result of this matrix-vector multiplication is the imagevector {right arrow over (w)}. The image vector w is nowpolynomial-expanded to the vector {right arrow over (x)} according to anestablished imaging rule. For a two-dimensional {right arrow over (W)}vector (w1, w2), the linking rule PSL1=LIN1, 2QU/AD11,12,22

is used to generate, for example, the following x vector:

{right arrow over (x)}=(1, w1, w2, w1*w1, w1*w2, w2*w2). The firstcomponent is always set in advance with 1 so that the estimated valuesgenerated later can be normalized to a sum of 1. Over thepolynomial-expanded quantity of vectors {{right arrow over (x)}},empirical moment matrices {right arrow over ({right arrow over(M)})}_(ij) are then generated for each pair (i,j) of classes inaccordance with Formula (2). (See below) Class i includes onlycharacters that the person has referenced as belonging to Class i. Eachmoment matrix is subjected to a regression in accordance with the methoddescribed in Schürmann J., Polynomklassifikatoren für dieZeichenerkennung

[Polynomial Classifiers for Sign Recognition]. In accordance withFormula (1), below the classifiers {right arrow over (A)}_(ij) resultfor each class pair (i,j). The ensemble of K*(K−1)/2 pair classifiersthus results as the network classifier. Each of these pair classifiers(two-class discriminators) is trained over corresponding pattern datasuch that it recognizes exactly two of the K available target classes.The following relationships apply: $\begin{matrix}{{{{A\left( {i,j} \right)}\lbrack k\rbrack} = {\sum\limits_{l}{{{M^{- 1}\left( {i,j} \right)}\left\lbrack {k,l} \right\rbrack}*{{\overset{\_}{x}(i)}\lbrack l\rbrack}*{p(i)}}}}{with}} & (1) \\\begin{matrix}{{{M\left( {i,j} \right)}\left\lbrack {k,l} \right\rbrack} = \quad {{\frac{1}{z\left( {i,j} \right)}{\sum\limits_{\{{({i,j})}\}}{{x(z)}_{k}{x(z)}_{l}}}} -}} \\{\quad \left( {{empirical}\quad {moment}\quad {matrix}\quad {of}\quad {second}\quad {order}} \right)}\end{matrix} & (2)\end{matrix}$

z(i,j)—number of characters of the classes (i,j)′ where {overscore(x)}(i)[l] is the average-value vector for the feature vector x(i) inthe component representation,

and p(i) is the frequency of appearance of Class i. In the presentfeature vector {right arrow over (x)}, the classifier A(i,j) estimatesthe value d(i,j) on Class i, where $\begin{matrix}{{d\left( {i,j} \right)} = {\sum\limits_{l}{{{A\left( {i,j} \right)}\lbrack l\rbrack}*{{x\lbrack l\rbrack}.}}}} & (3)\end{matrix}$

The classifier coefficients A(i,j) [l] are set such that the followingalways applies:

d(i, j)+d(j,i)=1.  (4)

Reduction in the number of pair classifiers: This process step is shownschematically in FIG. 2. After a network classifier has been generated,its estimation-vector spectrum {d(i,j)} is received over a learningsample that is virtually uniformly distributed by class, and in whichall of the class patterns to be recognized are represented sufficiently.Classification first generates polynomial-expanded vectors {right arrowover (x)}({right arrow over (w)}) from the image vectors {right arrowover (w)} that have undergone principal-axis transformation. Thespectrum of pair classifiers is the quantity of all d vectors {{rightarrow over (d)}} resulting from the classification of the learningsample with all pair classifiers. A single d vector is formed because,for each pair classifier {right arrow over (A)}_(ij), a scalar producthaving the vector {right arrow over (x)} is formed for an expanded imagevector {right arrow over (x)} of a sign of the learning sample, and theresults of the scalar products are entered into the vector {right arrowover (d)} in the sequence of pair classifiers. Statistical operationsare performed below on the feature quantity {d(i,j)} to determine thepartial system of pair classifiers {{right arrow over (A)}(i_(α),j_(α))}, which, as a target presetting, minimizes a scalar measurepossessing a high separation relevance. The reduction method employingthe following two scalar classification measures is explained by way ofexample:

Sum of the error squares S² in the D (I) discrimination space$\begin{matrix}{S^{2} = {\frac{1}{N}{\sum\limits_{\{ z\}}{\sum\limits_{I = 1}^{K}\left( {D_{I} - Y_{I}} \right)^{2}}}}} & (5)\end{matrix}$

where $\begin{matrix}{{\exists A_{I,j,k}^{Lin}},{D_{I} = {\sum\limits_{j = 1}^{K - 1}{\sum\limits_{k = {j + 1}}^{K}{A_{I,j,k}^{Lin}*{d\left( {j,k} \right)}}}}}} & (6)\end{matrix}$

D_(I)—estimate of the classifier on Class I$Y_{I} = \left\{ \begin{matrix}{{1\quad {if}\quad z}\quad \in \left\{ I \right\}} \\{0\quad {{otherwise}.}}\end{matrix} \right.$

The classifier A^(Lin) is generated according to Equation (1). It istherefore linear, because the linear d vector {right arrow over (d)}constructed from the ordered pair estimates d(i,j) and expanded by 1 tothe first place is used in calculating the moment matrix according toEquation (2) as an x vector. An advantage of this variation is that,over the course of the linear regression accompanying the calculation ofthe linear classifier A_(l,j,k) ^(Lin), a ranking according to theextent of the contribution to minimizing the residual error isexplicitly formed among the components d(i,j). This ranking is used tomake a selection from the K*(K−1)/2 available pair classifiers, whichform the reduced set of pair classifiers as a result. The method of(successively) minimizing the residual error need not be discussed indetail here. It is described in detail in Schürmann J.,Polynomklassifikatoren für die Zeichenerkennung [Polynomial Classifiersfor Sign Recognition], Section

7. The selection is limited to the 75 highest-ranked components. Thislimitation is a function of the primary storage capacity of thecalculating machine used for adaptation, and thus does not represent afundamental limitation. The result of the linear regression over thefeature quantity {d(i,j)} is therefore an ordered quantity {A(i_(α),j_(α))}_(α=1, . . . , 75) of pair classifiers, which offer a decreasingrelative contribution to reducing the quadratic residual error as theindex α increases. Only this selected quantity of pair classifierscontinues to be used over the further course of adaptation.

Entropy in the d(i,j) distribution space To form the entropy H of theunstructured ensemble of two-class discriminators, it is necessary tocalculate the following expressions: $\begin{matrix}{H = {\sum\limits_{\{{i,j}\}}H_{i,j}}} & (7) \\\begin{matrix}{H_{i,j} = \quad {- {\sum\limits_{I}{\sum\limits_{\{{d{({i,j})}}\}}{{p\left( {I{d\left( {i,j} \right)}} \right)}*}}}}} \\{\quad {\log_{2}{p\left( {I{d\left( {i,j} \right)}} \right)}*{p\left( {d\left( {i,j} \right)} \right)}}}\end{matrix} & (8)\end{matrix}$

The expression p(I|d(i,j)) is, per definition, the Bayes reverseprobability that the classifier A(i,j) has classified a Class I patternwith a given estimate d(i,j). With use of the Bayes' formula:$\begin{matrix}{{p\left( I \middle| \alpha \right)} = \frac{{p\left( \alpha \middle| I \right)}*{p(I)}}{\sum\limits_{J = 1}^{K}{{p\left( \alpha \middle| J \right)}*{p(J)}}}} & (9)\end{matrix}$

and the following definition:

N_(i,j,α, J): number of Class J characters, for which the followingapplies:

UR+(OR−UR)/MAX*(α−1)≦d(i, j)<UR+(OR−UR)/MAX*α  (10)

α=1, . . . , MAX,  (11)

it becomes possible to approximate the entropy H with arbitraryprecision through an empirical entropy H*. The parameters:

UR:=lower estimate threshold

OR:=upper estimate threshold

MAX:=number of histogram segments

are determined on the basis of the estimation-vector spectrum. Thevalues N_(i,j, α, J) are then incremented over the learning sample ofpatterns. After N_(i,j, α, J) have been determined, the following valuescan be calculated: $\begin{matrix}{H^{*} = {\sum\limits_{\{{i,j}\}}H_{i,j}^{*}}} & (12) \\{H_{i,j}^{*} = {\frac{- 1}{N}{\sum\limits_{\alpha = 1}^{MAX}{\sum\limits_{I = 1}^{K}{N_{i,j,\alpha,I}*\log_{2}\frac{N_{i,j,\alpha,I}}{\sum\limits_{J = 1}^{J = K}N_{i,j,\alpha,J}}}}}}} & (13) \\{{{V_{i,j}^{*}\lbrack I\rbrack}\quad\lbrack\alpha\rbrack}:=\left\{ \begin{matrix}{{{- N_{i,j,\alpha,I}}*\frac{\log_{2}N_{i,j,\alpha,I}}{\sum\limits_{J = 1}^{K}N_{i,j,\alpha,J}}};{1 \leq \alpha \leq {MAX}}} \\{0;{otherwise}}\end{matrix} \right.} & (14) \\{{{DOT}\left( {i,{j;k},l} \right)}:={\sum\limits_{J = 1}^{K}{\sum\limits_{\alpha = 1}^{MAX}{{{V_{i,j}^{*}\lbrack J\rbrack}\quad\lbrack\alpha\rbrack}*{{V_{k,l}^{*}\lbrack J\rbrack}\quad\lbrack\alpha\rbrack}}}}} & (15) \\{{{COS}\left( {i,{j;k},l} \right)}:=\frac{{DOT}\left( {i,{j;k},l} \right)}{\sqrt{{{DOT}\left( {i,{j;i},j} \right)}*{{DOT}\left( {k,{l;k},l} \right)}}}} & (16) \\{{{ANG}\left( {i,{j;k},l} \right)} = {\arccos \quad {{COS}\left( {i,{j;k},l} \right)}}} & (17) \\{{- {MAX}} \leq \beta \leq {+ {MAX}}} & (18) \\{{{COR}^{+}\left( {i,{j;k},l} \right)}:={\begin{matrix}\max \\\beta\end{matrix}\frac{\sum\limits_{J = 1}^{K}{\sum\limits_{\alpha = 1}^{MAX}{{{V_{i,j}^{*}\lbrack J\rbrack}\lbrack\alpha\rbrack}*{{V_{k,l}^{*}\lbrack J\rbrack}\left\lbrack {\alpha - \beta} \right\rbrack}}}}{\sqrt{{{DOT}\left( {i,{j;i},j} \right)}*{{DOT}\left( {k,{l;k},l} \right)}}}}} & (19) \\{{{COR}^{-}\left( {i,{j;k},l} \right)}:={\begin{matrix}\max \\\beta\end{matrix}\frac{\sum\limits_{J = 1}^{K}{\sum\limits_{\alpha = 1}^{MAX}{{{V_{i,j}^{*}\lbrack J\rbrack}\lbrack\alpha\rbrack}*{{V_{k,l}^{*}\lbrack J\rbrack}\left\lbrack {{MAX} - \alpha - \beta} \right\rbrack}}}}{\sqrt{{{DOT}\left( {i,{j;i},j} \right)}*{{DOT}\left( {k,{l;k},l} \right)}}}}} & (20) \\{{{COR}\left( {i,{j;k},l} \right)}:={\max \left\{ {{{COR}^{+}\left( {i,{j;k},l} \right)};{{COR}^{-}\left( {i,{j;k},l} \right)}} \right\}}} & (21)\end{matrix}$

After the calculation has been made, all statistical values connectedwith the entropy are determined. Now the partial system

1. whose sum of individual entropies produces a minimal total entropyand

2. whose components have the statistically smallest-possiblecorrelations among themselves must be determined. These two conditionsare met by adhering to the following selection criteria:

1. Entropy ranking:

 Establish the sequence A(i_(α),j_(α)), which is unambiguous up to theclassifiers having identical entropy, with:

H* _(i) ₁ _(, j) ₁ ≦H* _(i) ₂ _(, j) ₂ ≦ . . . ≦H* _(i) _(n) _(, j) _(n)  (22)

2. Start of induction:

 Select A({overscore (i)}₁, {overscore (j)}₁):=A(i₁,j₁) as the startingcandidate. This is the classifier having minimal entropy.

3. End of induction from k to k+1:

 Here k candidates A({overscore (i)}₁, {overscore (j)}₁), . . . ,A(i_(k),j_(k)) are selected, where

H*_({overscore (i)}) ₁ _(, {overscore (j)}) ₁ ≦ . . .≦H*_({overscore (i)}) _(k) _(, {overscore (j)}) _(k) :=H*_(i) _(m)_(, j) _(m)   (23)

and

∃A(i_(m), j_(m)) A,(i_(m),j_(m))=A({overscore (i)}_(k), {overscore(j)}_(k)); k≦m<n  (24)

and

ANG({overscore (i)}_(α), {overscore (j)}_(α); {overscore (i)}_(β),{overscore (j)}_(β))>Θ_(crit); 1≦α<β≦k.  (25)

Determine a smallest l with

l=1, . . . , n−m  (26)

such that

ANG({overscore (i)}_(α), {overscore (j)}_(α); {overscore (i)}_(m+1),{overscore (j)}_(m+1))>Θ_(crit); α=1, . . . k  (27)

If l exists, select:

ANG({overscore (i)}_(k+1), {overscore (j)}_(k+1)):=A(i_(m+1),j_(m+1))  (28)

Otherwise, terminate with k candidates. The free angle parameterΘ_(crit) can be set through gradient methods so as to yield a definednumber of selected components. The selection criterion is refinedthrough the minimizing of the correlation coefficient COR(i,j;k,l). Inthis case, the filter condition (25) is:

COR({overscore (i)}_(α), {overscore (j)}_(α); {overscore (i)}_(β),{overscore (j)}_(β))<χ_(crit); 1≦α<β≦k.  (29)

The maximum permitted correlation coefficient between two respectivepair classifiers is thus limited by χ_(crit). The correlationcoefficient further meets the symmetry conditions:

COR(i,j;k,l)=COR(j,i;k,l)  (30)

COR(i,j;k,l)=COR(i,j;l,k)  (31)

COR(i,j;k,l)=COR(k,l;i,j)  (32)

The method is embodied for the aforementioned classification measures,but is not limited to them in principle. Other embodiments of the methodmay be based on other classification measures.

Generation of an assessment classifier: This process step is illustratedin FIG. 3. After the reduced sequence of pair classifiers {{right arrowover (A)}_(i) _(α) _(j) _(α) } has been determined identified in FIG. 3as (RNK), this is used over an expanded learning sample that contains atleast 9,000 patterns per discrimination class to calculate estimatevectors. The resulting feature quantity {d(i_(α), j_(α))} comprisesindividual vectors of the dimension 75, with the target identification(pattern meaning) for each pattern being entered synchronously from thequantity of image features. A quadratic expansion to a vector {rightarrow over (x)} that is linked in polynomial fashion is effected overeach feature vector. Then a moment matrix {right arrow over ({rightarrow over (M)})}_((I)) is generated for each Class I on the basis ofthe vectors {right arrow over (x)}. A total moment matrix {right arrowover ({right arrow over (M)})} is generated through weighted averagingover all class-wise moment matrices {right arrow over ({right arrow over(M)})}_((I)). A regression then generates the class-wise assessmentclassifiers according to the following formula:

{right arrow over (A)} _(I) ={right arrow over ({right arrow over (M)})}⁻¹ *{right arrow over (x)} ₁ *P(I)  (33)

where P(I) is the frequency of appearance of Class I.

Hence, without further hypotheses about the relationship betweenpair-classifier estimates and the total estimate for the individualclasses, an instrument is created that calculates this relationship fromthe approximate statistical distribution in the d(i,j) space. The moregeneral the underlying learning sample, the better the computationalapproximation of the optimum relationship between pair estimates and thetotal estimate. The successes of the method particularly demonstratethat the measure of arbitrariness in the selection of suitable learningsamples is more limited than the measure of arbitrariness that is agiven in the construction of a hypothetical mathematical relationship.The adaptation phase ends with the generation of a reduced set of pairclassifiers and the assessment classifier coupled to this set. The totalstatistics of the application are now represented in the moment matrixunderlying the assessment classifier. This moment matrix is administeredin the archives for the following adaptation procedures. For integrationinto the product, only the reduced sequence of pair classifiers isassociated with an appropriate assessment classifier. The method thuspermits the generation of extremely-compressed units of information(moment matrices), which represent a given application of patternrecognition and can be used for future iterations (country-specificadaptations).

Use of the method of the invention in a corresponding arrangement:

This process step is shown schematically in FIG. 4. For use in areal-time system, the method makes available the ordered, reducedsequence of pair classifiers and an assessment classifier appropriatefor the respective application. With insufficient knowledge of whichassessment classifier provides the best results for the application, theselection can be increased, and a reading test verifies the optimumassessment classifier. The minimum requirement on a target hardware isdictated by the presence of the following components:

1. NORM: A module that employs the normalizing methods of the state ofthe technology to transform the feature vectors that have been convertedinto binary form, and places them in an input vector of constant length(v vector).

2. POLY: A module that, in accordance with an established imaging rule,transforms a normalized w vector into a polynomial-expanded x vectorthat serves as the input vector for classification.

3. MAT: A matrix-multiplication network that, with externalmicro-program control, calculates scalar products between integervectors.

4. STORE: A memory for storing intermediate values and for addressingthe classifier coefficients.

5. ALGO: A command register for storing control commands or machinecodes to be executed.

The memory is partitioned into a unit STORE1 for storing and readingintermediate results and a unit STORE2 for reading unchangeable values.

The operations illustrated in FIG. 4 are necessary for the describedmethod. The following steps are performed consecutively, with control bythe command register ALGO:

1. The recognition process begins with the reading of the pixel image.After the pattern has been sampled, it is stored in STORE1 as a binaryimage vector {right arrow over (u)}. In principle, a black pixelcorresponds to the binary 1 and a white pixel corresponds to 0. NORM nowgroups the binary image elements on the basis of measurements of thepixel densities by column and line such that the result is a gray-imagevector {right arrow over (v)}, which corresponds to a 16×16 imagematrix. Each element of the gray image is scaled into 256 stages ofgray. NORM writes the vector {right arrow over (v)} into STORE1.

2. The module MAT reads v out of STORE1 and the principal-axistransformation matrix {right arrow over ({right arrow over (B)})} out ofSTORE2. This matrix is available from the adaptation phase through theexecution of a standard principal-axis transformation over the featurequantity {{right arrow over (v)}}. The matrix and vector are multiplied.The result is the transformed image vector {right arrow over (w)}, whichis stored in STORE1.

3. The module POLY reads the vector {right arrow over (w)} out of STORE1and a list PSL1 out of STORE2. PSL1 contains the control informationabout how the components of the vector {right arrow over (w)} are to belinked. The x vector is generated and is stored in STORE1.

4. The module MAT reads the x vector out of STORE1 and the matrixelements of the RNK, which are stored by class as vectors {right arrowover (A)}(i_(a), j_(a)). MAT now forms a scalar product with the xvector for each A vector. The number of scalar products is identical tothe number of present A vectors. The scalar products are transferredinto the d vector in the sequence of the A vectors. MAT stores the dvector in STORE1.

5. The module POLY reads the d vector out of STORE1 and the PSL2 out ofSTORE2. POLY then constructs the X vector by applying PSL2 to the dvector, and stores the X vector in STORE1.

6. The module MAT reads the X vector out of STORE1, this time readingthe A matrix of the assessment classifier out of STORE2. This matrixcontains as many A vectors as classes estimated by the assessmentclassifier. After the reading-in process, MAT forms a scalar productwith the X vector for each A vector. The scalar products areincorporated into the D vector in the sequence of the A vectors. MATwrites the D vector into STORE1.

After this loop is complete, the D vector that includes an estimate inthe number interval [0, 1] is available in STORE1 as a result for eachof the K classes of the discrimination problem. It is now the decisionof a post-processing module whether to accept or reject the assessmentof the classifier corresponding to the maximum estimate. Support of thepost-processing is assured by the classification in that the continuousestimation-vector spectrum of the assessment classifier is known fromthe adaptation phase, from which spectrum thresholds for rejection oracceptance of an estimate, which are statistically guaranteed with theuse of a cost model, can be derived. If the estimation-vector statisticsare contained in cycles in reading operation, statistical prognoses canbe derived through a dynamic readjustment of thresholds.

Application to general alphanumeric recognition:

In the recognition of a general class set comprising 10 numerals, 30capital letters, 30 lowercase letters and 20 special characters, a totalof at least 90 classes is to be discriminated. If it were desired tosolve the recognition problem with a complete network classifier, 4005pair classifiers would correspond to these 90 classes. Both storagecapacity and calculation capacity of conventional reading electronicswould thus be overloaded. The problem worsens if, instead of the 90meaning classes, so-called gestalt classes are inserted for reasons ofrecognition theory, the gestalt classes representing the respectivetypical writing forms of the same sign class. Based on the gestaltclasses, up to 200 classes must be separated.

This problem is solved by training a meta-class classifier comprising agroup-network classifier and a group-assessment classifier. Theresulting classification system is illustrated in FIG. 5. In this case,the meta-class classifier recognizes groups of character. A metriccalculated on the basis of the moment matrices of the individual gestaltclasses clusters the character into groups. The groups are structuredsuch that similar gestalt classes are in the same group. The momentmatrix of a group is formed by weighted averaging over all of the momentmatrices of the gestalt classes represented in the group. With eightgroups, for example, encompassing a maximum of 30 gestalt classes, themeta-class classifier itself is realized as a coupling of a networkclassifier with an assessment classifier generated according to theinvention.

Consequently, one pair classifier is formed for each group pair. Thegroup-assessment classifier is then based on 28 pair classifiers {rightarrow over (g)}_(IJ). A group-assessment classifier {right arrow over(G)}_(I), which estimates in the eight groups as classes, is trained onthese pair classifiers. In the next step, for each group of gestaltclasses, a reduced set of pair classifiers {right arrow over (a)}_(i)_(I) _(, j) _(I) having a corresponding assessment classifier {overscore(A)}_(J) _(I) is trained according to the described method. In thereading phase, first the group-assessment classifier decides to whichgroup the relevant character belongs. Subsequently, thereduced-character network classifier corresponding to the discriminatedgroup(s) is activated as a function of thresholds. This classifiergenerates an estimated value for each gestalt class represented in thegroup. Thresholds τ_(I) are adjusted for each group on a statisticalbasis; these thresholds regulate from which estimate quality D_(I) ofthe group-assessment classifier the corresponding reduced characternetwork classifier {right arrow over (a)}_(i) _(I) _(, j) _(I) is to beactivated. The thresholds can be set such that at least one of thereduced character network classifiers will reliably have the correctrecognition result with the least-possible calculation effort. Ifresults of numerous character-assessment classifiers are present, anormalized total result is formed in that all estimates of an activatedcharacter-assessment classifier {right arrow over (A)}_(J) _(I) aremultiplied by the estimate of the corresponding group-assessmentclassifier {right arrow over (G)}_(I), while the estimated values ofnon-activated character network classifiers are pre-set with 1/number ofclasses prior to multiplication with the group estimate. With thisclassification expansion, it is possible to perform general alphanumericrecognition tasks completely on the basis of the method of theinvention.

Further possible solutions:

1. Expansion to multi-class discriminators as base-class classifiers

 The method is not limited in principle to the adaptation of theassessment classifier by way of an ensemble of two-class discriminators;rather, it can be expanded to ensembles of multi-class discriminators.Hence, in an ensemble comprising n class discriminators, the n-th classcan represent the complement class to the remaining n−1 classes of thesame discriminator in relation to all present classes.

2. Cascading system with classifiers of the invention: The practice ofautomatic reading operation shows that, with constant reading output, aconsiderable reduction in outlay is possible in that a control structureonly calls the higher-structured recognition modules if aweaker-structured module has already attained an insufficient estimatequality for a present pattern. The classifiers according to theinvention are incorporated seamlessly into this strategy. A system ofclassifiers according to the invention is constructed, which isclassified as building in cascading fashion on one and the same baseclassifier. The (n+1)-th assessment classifier is trained with thepatterns that have been estimated by the n-th assessment classifier ashaving an insufficient quality. In practical reading operation, thiscascade generally breaks off after the first assessment classifier, butis continued for rare and difficult patterns, and thus increases thereading rate with a low increase in the calculation load.

What is claimed is:
 1. A method of pattern recognition on the basis of statistics, which, for an object to be recognized, estimates an association of each target class of a class set with a numerical value on the basis of a complete ensemble of two- or multi-class classifiers, the numerical value resulting from cascaded application of polynomial classifiers, comprising the steps of: selecting the two- or multi-class classifiers whose estimates contribute the most, with high separation relevance, to minimizing a scalar measure calculated over an estimation-vector spectrum, from all two- or multi-class classifiers, over their estimation-vector spectrum on a learning sample in which all class patterns to be recognized are represented sufficiently; using the selected two- or multi-class classifiers to form estimate vectors over an expanded learning sample; using said estimate vectors to generate feature vectors that have been expanded through polynomial linking; and on the basis of said feature vectors, calculating an assessment classifier for estimating onto all target classes.
 2. The method according to claim 1, wherein said step of selecting includes the step of using a sum of error squares in a discrimination space as said scalar measure.
 3. The method according to claim 1, wherein said step of selecting includes the step of using entropy in a distribution space as said scalar measure, wherein frequency of appearance of each feature state of the two- or multi-class classifier estimates is determined over all feature states.
 4. The method according to claim 1, further comprising the steps of: dividing a large target-class quantity into a plurality of target-class quantities, for which said selecting of the two- or multi-class classifiers is performed, and wherein, from this, the assessment classifier is generated; and determining a resulting total estimate from results of the assessment classifiers.
 5. The method according to claim 4, wherein said determining step includes the step of forming a Cartesian-expanded product vector from result vectors of the assessment classifiers, from which Cartesian-expanded product vector a quadratic assessment classifier that determines the total estimate is formed.
 6. The pattern-recognition method according to claim 4, wherein said determining step includes the steps of forming a Cartesian-expanded product vector from result vectors of the assessment classifiers; transferring said Cartesian-expanded product vector into a transformed vector by means of a subspace transformation using a transformation matrix having a corresponding eigenvalue distribution; adapting a quadratic classifier using only the most critical components of said transformed vector, which components correspond to the eigenvalue distribution of the transformation matrix; and using the adapted quadratic classifier, mapping the transformed and reduced vector onto target classes for an estimated value.
 7. The method according to claim 4, further comprising the step of, prior to activation of the step of selecting the two- or multi-class classifiers, a meta-class classifier that is trained over groups of class quantities generates estimates over the groups, which contain characters; and wherein: said step of selecting includes the step of activating the two- or multi-class classifiers for the characters of the groups whose estimated values lie above an established threshold; and said determining step includes the step of linking the group estimates to the estimates of the respectively-associated character-assessment classifiers for the character classes contained in the respective group according to a unified rule such that the sum over all character estimates linked in this manner yields a number that can be normalized to
 1. 8. An arrangement for pattern recognition on the basis of statistics, which, for an object to be recognized, estimates an association of each target class of a class set with a numerical value on the basis of a complete ensemble of two- or multi-class classifiers, the numerical value resulting from cascaded application of polynomial classifiers, comprising: means for statistically-optimized selection of the two- or multi-class classifiers whose estimates contribute the most, with high separation relevance, to minimizing a scalar measure calculated over an estimation-vector spectrum, from a complete ensemble of all two- or multi-class classifiers based on their estimation-vector spectrum over a learning sample in which all class patterns to be recognized are represented to a sufficient extent; means for generating polynomial-expanded feature vectors that represent the order of the system of selected two- or multi-class classifiers, said polynomial-expanded feature vectors being formed over an expanded learning sample, said expanding carried out by means of polynomial linking; and means for calculating an assessment classifier that uses the feature vectors of the system of selected two- or multi-class classifiers to calculate an estimation vector that, for each target class, contains a numerical value as an approximated conclusion probability regarding the association of a classified pattern with said class of patterns.
 9. A method of pattern recognition for estimating input characters using probabilities of their memberships in n character classes, comprising the steps of: performing an optimal selection procedure, the optimal selection procedure determining from a given system of N statistically non-independent classifiers a set of K classifiers that contribute the most to minimizing a scalar classification measure for all comparable subsystems of fixed dimension K, where K<N; adjusting K such that a linear K-dimensional output vector of the subsystem can be polynomially extended to at least quadratic terms below a predetermined magnitude of terms; and constructing an optimal assessment classifier that uses the polynomially-extended output vector of the optimal subsystem as its input and maps it to a final probability vector for the n classes as its output. 